Efficient SQL-Querying Method for Data Mining in Large Data Bases

نویسنده

  • Hung Son Nguyen
چکیده

Data mining can be understood as a process of extraction of knowledge hidden in very large data sets. Often data mining techniques (e.g. discretization or decision tree) are based on searching for an optimal part i t ion of data wi th respect to some optimization criterion. In this paper, we investigate the problem of optimal binary part i t ion of continuous attr ibute domain for large data sets stored in relational data bases (RDB). The critical for t ime complexity of algorithms solving this problem is the number of simple SQL queries like SELECT COUNT FROM ... WHERE attribute BETWEEN ... (related to some interval of attr ibute values) necessary to construct such partitions. We assume that the answer t ime for such queries does not depend on the interval length. Using straightforward approach to optimal partit ion selection (with respect to a given measure), the number of necessary queries is of order O(N), where N is the number of preassumed part itions of the searching space. We show some properties of considered optimization measures, that allow to reduce the size of searching space. Moreover, we prove that using only O(logiV) simple queries, one can construct the parti t ion very close to optimal.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Querying Hierarchical Data in Very Large Databases

Hierarchical data, such as Partially Ordered Set (POSET) is tremendously used in relational databases, especially in data mining and data warehouse based-applications. Unfortunately, SQL (Structured Query Language) does not effectively support hierarchical data structure to manage this sort of data, for example, in Oracle, a CONNECT BY operator is used to query data organized into trees, howeve...

متن کامل

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

A huge amount of time is needed for making the dataset for the data mining analysis because data mining practitioners required to write complex SQL queries and many tables are to be joined to get the aggregated result. The traditional SQL aggregations prepare the data sets in vertical layout that is; they return result on one column per aggregated group. But for the data mining project, the dat...

متن کامل

SQL based frequent pattern mining

Data mining on large relational databases has gained popularity and its significance is well recognized. However, the performance of SQL based data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, as well as the lack of suitable declarative query language support. Frequent pattern mining is a foundation of s...

متن کامل

Improving Analysis Of Data Mining By Creating Dataset Using Sql Aggregations

In Data mining, an important goal is to generate efficient data. Efficiency and scalability have always been important con-cerns in the field of data mining. The increased complexity of the task calls for algorithms that are inherently more expensive. To analyze data efficiently, Data mining systems are widely using datasets with columns in horizontal tabular layout. Preparing a data set is mor...

متن کامل

Caching for Multi-dimensional Data Mining Queries

Multi-dimensional data analysis and online analytical processing are standard querying techniques applied on today’s data warehouses. Data mining algorithms, on the other hand, are still mostly run in stand-alone, batch mode on flat files extracted from relational databases. In this paper we propose a general querying model combining the power of relational databases, SQL, multidimensional quer...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999